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ABSTRACT 

Community analysis algorithm proposed by Clauset, New- 
man, and Moore (CNM algorithm) finds community struc- 
ture in social networks. Unfortunately, CNM algorithm does 
not scale well and its use is practically limited to networks 
whose sizes are up to 500,000 nodes. The paper identifies 
that this inefficiency is caused from merging communities in 
unbalanced manner. The paper introduces three kinds of 
metrics (consolidation ratio) to control the process of com- 
munity analysis trying to balance the sizes of the commu- 
nities being merged. Three flavors of CNM algorithms are 
built incorporating those metrics. The proposed techniques 
are tested using data sets obtained from existing social net- 
working service that hosts 5.5 million users. All the meth- 
ods exhibit dramatic improvement of execution efficiency 
in comparison with the original CNM algorithm and shows 
high scalability. The fastest method processes a network 
with 1 million nodes in 5 minutes and a network with 4 
million nodes in 35 minutes, respectively. Another one pro- 
cesses a network with 500,000 nodes in 50 minutes (7 times 
faster than the original algorithm), finds community struc- 
tures that has improved modularity, and scales to a network 
with 5.5 million. 

Categories and Subject Descriptors 

H. 2.8 [Database applications]: Data mining; G.2.2 [Graph 
Theory]: Graph algorithms; H.3 [Information storage 
and retrieval]: Information networks 

Keywords 

Community analysis, clustering, social networking service 

I. INTRODUCTION 

Research of complex networks attracts interests of broad 
scientific disciplines. Examples of complex networks include 
World Wide Web (WWW), citation networks, human ac- 
tivities on the Internet (e.g., exchange of emails, social net- 
working system, consumption behavior on the e-commerce, 
and Web- log track-back network), physical phenomena, and 
biochemical networks among many others. 

Finding community structure in networks is an important 
first step to grasp inherent complex structure of social net- 
works. Due to ever expanding use of digital networks, traces 
of global human activities have become available in digital 
forms. There are many research activities that attempt to 
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define the notion of communities and propose community 
analysis algorithms [8, 7, 9, 4, 14, 10, 15, 11, 3, 13, 2, 12]/ 
We implemented a fast community analysis algorithm pro- 
posed by Clauset, Newman, and Moore [3] (CNM algorithm) 
and applied it to analyze various subsets of an acquaintance 
relationship network obtained from a social networking sys- 
tem (SNS). The algorithm performs well for a mid-scale sub- 
set of the network that consists of less than 500,000 users. 
However, the algorithm was incapable to analyze larger net- 
works. 

We observed that merging communities of unbalanced 
sizes has great impact on computational efficiency of CNM 
algorithm. From this observation it was expected that merg- 
ing communities in a balanced manner will improve the ef- 
ficiency of the algorithm. In this paper, we introduce the 
notion of consolidation ratio, which is a measure of balanced- 
ness of the community pairs, and use it as well as modularity 
as means to find next pair of communities to merge into a 
larger one. 

The paper presents three types of consolidation ratio. Three 
flavors of CNM algorithms, each of which incorporates one 
of those consolidation ratio, were built. They are imple- 
mented as a single-threaded Java program and were tested 
using as data sets various subsets of a SNS network that 
hosts 5.5 million users. The fastest program finds commu- 
nity structure in a network of 1 million nodes in 5 minutes. 
Computational efficiency and scalability of the proposed al- 
gorithm, and quality of the generated community structures 
are discussed in detail. 

The structure of the paper is as follows: Section 2 com- 
pares our work with other related research activities, Sec- 
tion 3 explains the CNM algorithm and identifies the source 
of its performance inefficiency, Section 4 introduces a heuris- 
tics that makes use of consolidation ratio, Section 5 evaluates 
the proposal, and Section 6 concludes the paper. 

2. RELATED WORK 

Analysis of community structures of social and cyber net- 
works is an effort to find cyber-communities. We believe 
that such found cyber-communities support reasoning about 
structure, nature, and dynamics of real-communities. Many 
community analysis techniques have been proposed by re- 
searchers of broad discipline. There are two types of algo- 
rithms that are designed for this purpose. One type takes 
a graph and one or more seed node(s), and gives a com- 
munity structure that includes the seed node(s) [8, 4, 14, 
10]. This type of community analysis algorithm is widely 
used for analysis of WWW link structure. In WWW link 
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analysis, Web pages or Web sites are modeled as nodes and 
hyper-links are treated as edges, forming a huge directed 
graph. 

'HITS' algorithm [8, 7] proposed by Kleinberg focuses on 
two types of characteristic structures called authorities and 
hubs that are defined in mutually recursive manner. A Web 
page given a higher authority value is regarded as an author- 
itative page. It is referenced from many hub pages which in 
turn collect many links to authoritative pages. HITS algo- 
rithm assigns an authority value and a hub value to each 
Web page in an iterative process. Link structures formed 
by authorities and hubs can be understood as cores of inter- 
related community structures. 

Dean and Henzinger used HITS algorithm to build a new 
Web search engine called 'Companion' [4]. Unlike standard 
keyword-based search engines, Companion takes Web pages 
of interest for the user and performs a Web link analysis 
to find a set of Web pages whose contents are closely re- 
lated with each other. Toyoda and Kitsuregawa improved 
the performance of Companion's link analysis and proposed 
an improved version called 'Companion-'. Companion- vi- 
sually addresses internal structure of the Web community 
[14]. 

Another type of community analysis algorithms takes a 
graph and divide it into a set of densely connected subgraphs 
[8, 9, 6, 5, 15, 11, 3, 2, 12]. Various notions of communi- 
ties have been proposed. Some work "defines" communi- 
ties by the algorithm. Kumar and others formulated graph 
partition problem as finding minimum complete bipartite 
subgraphs. Flake and others gave a concise definition of 
cyber-communities based on graph-theoretic foundation [6, 
5] and proved that community analysis falls into maximum- 
flow, minimum-cut problem. Newman and Girvan proposed 
a measure called modularity, which is a quantitative mea- 
sure of quality of graph partitioning [11]. A fast algorithm 
that finds a community structure in a bottom-up manner, 
greedily maximizing on modularity was presented in [3]. Our 
research is based on this work. 

3. CNM ALGORITHM 

Newman and Girvan attempt to measure the quality of 
network clustering by means of modularity [11]. Their algo- 
rithm (CNM algorithm) is a bottom-up greedy optimization 
that continuously finds and merges pair of communities try- 
ing to maximize modularity of the community structure [3] . 
This section briefly presents the notion of modularity, an 
outline of CNM algorithm, and addresses its computational 
inefficiency. 

3.1 Modularity 

Modularity of network's community structure is a quan- 
titative measure of the quality of clusterings (i.e., a graph 
partitioned into a set of subgraphs) [11]. It can be used 
to compare the quality of different clusterings of the same 
network. It is desirable that members of a community have 
a dense intra-community links and small number of links 
connected to members of other communities. This idea is 
embedded in the formulation of modularity as explained sub- 
sequently. 

Let G = (V, E) be a undirected graph that represents a 
social network. For example, an acquaintance network of a 
SNS can be represented by (U,F), where U is a set of users 
and F represents friendship (if users iti and U2 are friends 



then (111,112) £ F)). Adjacency matrix A is another way to 
represent edges: 

(l (v,w)€E 
I otherwise. 

It can be used to define the number of total edges (rn = 
Yl v ,wev Ami/2) and the degree of a node v (k v = J2 W 

A clustering (C) of G into a set of communities is a parti- 
tioning of nodes V into its subsets: 

C = {ci,c 2) . . .} , a n Cj • = (i / j), [Jc = V 

c;SC 

Proportion of edges that link members of communities c; 
and Cj in the whole graph is given by e;j. Likewise propor- 
tion of Ci's edges in the whole graph is given by m: 

cii = k v /2m. 

Definition of modularity as given below states that com- 
munities in a good clustering of a graph G has dense intra- 
community links and less inter-community links: 

Q(G,C) = £)(e«-a?). 

i 

3.2 Algorithm 

Newman and Girvan presented a greedy community anal- 
ysis algorithm that optimize on modularity. Later, Clauset, 
Newman, and Moore proposed a more efficient algorithm 
{CNM algorithm) that works the same as the former pro- 
posal in principle but incorporates sophisticated data struc- 
tures [3]. 

The algorithm starts from a totally unclustered situation, 
where each node in a graph forms a singleton community. 
Then computed is for each pair of communities, expected 
improvement of modularity when they merge: 

AQ^c, = Q{G,C- a - Cj + (a u Cj )) - Q(G,C). 

The algorithm repeatedly chooses a community pair that 
gives the maximum AQ value and merges them into a new 
community (Algorithm 1). During the merge process, AQ 
values of the communities that adjoin the new community 
needs to be updated. Because the number of community 
pairs in the clustering decreases monotonously, the algo- 
rithm eventually stops when there remains no community 
pairs to merge. 

CNM algorithm uses two data-structures to find a com- 
munity pair with maximum AQ value: (1) a balanced binary 
tree (or heap tree) of community pairs (a,Cj) and (2) a max 
heap (or priority heap) of community pairs that is sorted by 
AQc i>c - They achieve logarithmic order of computational 
cost for removal and insertion of a community pair, and find- 
ing a community pair (d,Cj) with maximum AQ value for 
a given d. For each community, the community pair with 
maximum AQ value are stored in a system-wide max heap. 

By using these data structures, search for the community 
pair with the largest AQ value is performed in two stages. 
Firstly, each community searches in its max heap for the pair 
with the largest AQ among its community pairs and stores it 
in a system-wide max heap that is used in the second stage. 
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Algorithm 1 An outline of the algorithm proposed by 
Clauset et al [3] 

C := {v G V\{v}}; 

function join(ci, cj) { 
return C — Ci — Cj + (c, Ucj); 
} 

procedure updateDeltaQQ { 
Vci,Cj G C. 

AQi,j ■= Q(GJoin( Cl , Cj )) - Q(G,C)\ 
} 

while (true) { 

updateDeltaQ(); 

Find (ci,Cj) G C 2 that has maximum AQc. Cj . . 
if (max(AQci,c- < 0) break; 
C:= join(ci,Cj); 
} 




Figure 1: Analysis time required for networks with 
various scales (100K, 200K, . . . , 500K nodes). Each 
bar represents time required for merging 10,000 
community pairs. 



Elements in the system- wide max heap are candidates of the 
community pair who has system-wide maximum AQ value. 
When all the candidates are stored in the system-wide max 
heap, the pair with system-wide maximum AQ value can be 
easily found. 

Newman and Girvan showed that update of AQf jjC for 
a community pair (a, Cj) needs to be performed only when 
either a or Cj merges. Also update of AQc <c . is a simple 
arithmetics using its neighbors' past AQ values. Clauset 
and others have applied this algorithm to several real world 
social networks including purchase transactions offered by 
Amazon which contains more than 400,000 nodes and 2 mil- 
lion edges. 1 

3.3 Performance inefficiency 

The authors have programed CNM algorithm and attempted 
to analyze an acquaintance network of an SNS called "mixi 2 " 
that hosted about one million users in October 2005. The 
experiment was performed on a PC (Intel Xeon 2.80GHz, 
L2 cache = 2MB, Memory = 4GB). However, in spite of the 
good scalability as advertised in [3] , the authors have found 
it was impractical to analyze this mega-scale social network 
using CNM algorithm. The experiment was stopped after 
a week when less than 10% of the whole analysis was fin- 
ished. Yuta and others has conducted similar experiment on 
earlier mixi network on Linux running on Pentium IV 2.8 
GHz with 1GB memory and states that community analysis 
of an SNS network of 360,000 users using CNM algorithm 
took six hours [16, 17]. 

To figure out the performance bottleneck of CNM algo- 
rithm, we conducted community analysis on a various sub- 
sets of mixi SNS network. The mixi SNS gives each user an 
ID number starting from "1", in the order of user registra- 
tion. Therefore, the mixi SNS network can be represented 
by a graph G mixi = (U, F), where U — {1, 2, . . .} is the set 
of user IDs and F C U x U is a set of acquaintance relation- 
ship, namely G F if and only if two users identified by 
i and j are friends. We built a subset of mixi acquaintance 

1 http:/ /www. amazon.com/ 

2 mixi (http://mixi.jp/) is the largest invitation-based SNS 
in Japan. 



graph G™ ixi as follows: 

G^ = {U(n),Fn{U(n)xU(n))) 
where U(n) = {u G U\u < n} 

Figure 1 illustrates time required for community analy- 
sis of various subsets of the social network: G]^Jf , G^^i j 
GjL°xf , GtSiif, and G™£f . Each bar of the graph depicts 
time required to perform 10,000 merges of community pairs. 
For example, in case of Gf^f' (black bars), 427,794 merges 
are performed and the third 10,000 merges took about 1,600 
seconds. 

For each data set, most of the computation time is con- 
sumed for the first half of the merging process and computa- 
tion time decreases dramatically for the latter half. For ex- 
ample, in case of G^ x f" > merging 10,000 communities takes 
less than 200 seconds after 250,000 communities are merged. 

The gross area of each pattern is the elapsed time of re- 
spective subset of the network (Elapsed time for G™ ixi is 
compared with our proposal in Figure 5 on page 5) . In this 
experiment, we can approximate the elapsed time for anal- 
ysis of G" ixi by T(n) w 1.5 • 10-8x 213±0 - 104 . 

[3] estimates the computational complexity of CNM al- 
gorithm to be 0(md log n), where n and m are numbers of 
nodes and edges, respectively, and d is the height of dendro- 
gram 3 . It also discusses in a sparse network m and d can be 
approximate by n and logn, respectively and that compu- 
tational complexity will be 0(n log 2 n) for social networks. 
This discussion and the above mentioned super quadratic 
computational cost observed in our experiment contradict. 
Investigation of the structure of the dendrogram suggests 
that d ~ logn does not hold for the analysis of mixi SNS 
network. 

Then the authors carefully observed a merge logs that 
record how community pairs are merged into larger ones. 
The merge logs suggested that among huge number of com- 
munities only a small portions are growing fast, merging 

3 A dendrogram is a binary tree that represents a history of 
merge process. If a pair of nodes (ci,Cj) are merged into a 
new community c^, the dendrogram for will be a binary 
tree whose subtrees are dendrograms for c; and Cj . 
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Figure 2: Consolidation ratio of each merge step 
illustrated in a partially log-scale chart. 

in many tiny communities. Because of this phenomenon, a 
huge unbalanced dendrogram was constructed. 

This phenomenon can be clearly seen in Figure 2 which 
presents unbalancedness of merge steps are through out the 
progress of community analysis for G^bd ■ For this pur- 
pose, we have defined the notion of consolidation ratio of 
community merge, which is defined as follows: 

ratio(d,Cj) = min(|ci|/|c,|, |cj|/|ci|). 

Figure 2 plots, for n-th merge step, Ck := join(cj, c,), 
(n, ratio(a, Cj)), where the size of a community (|c|) is mea- 
sured in terms of the number of its links to other commu- 
nities. In this figure, we can see growth of some eight large 
communities in the first half of the community analysis. We 
can conclude that unbalanced growth of large communities 
is the primary cause of performance degradation when CNM 
algorithm is applied to our dataset. 

Unbalanced merging process, makes the height of the den- 
drogram grow more or less proportionally to its size and 
leads to degrade the computational efficiency to 0(n 2 log n). 

4. ALGORITHM 

In the previous section, we have seen the cause of the in- 
efficiency of CNM algorithm. In this section, we present a 
data structure and three types of heuristics that dramati- 
cally improve computational efficiency of CNM algorithm. 

4.1 Data structure 

In CNM algorithm, heavy operations are performed when 
it finds for the community pair that has the maximum AQ 
value and when merging communities. We have replaced 
balanced binary trees and max heaps, originally suggested 
in [3] by a doubly-linked list that is sorted in the order of 
community ID. 

Each community Ci in our system has a data structure to 
store references to neighboring communities which is repre- 
sented by a list of pairs of communities (see Figure 3). The 
list is sorted by the order of Community ID. For example, 
a community ci that links to communities C2, C3, C4, C5, 
... is represented by a community object that has a list of 
community pairs {(1, 2), (1, 3), (1, 4), (1, 5), . . .}. A commu- 
nity pair has references to the communities it belongs to. 



Figure 3: Our implementation of communities. A 
community maintains a link to its neighboring com- 
munities in a list of community pairs and a pair that 
has maximum AQ value. 
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Figure 4: Merge of ci and C5 in Figure 3 produced a 
new community C7. During the merge, community 
pairs for the merged updating their AQ values. 

For example, in Figure 3, community pair (ci,C2) has links 
pointing at communities c\ and C2. Merging two commu- 
nities effectively is a process of merging their community 
pairs, eliminating duplicates and updating their AQ values. 
By the use of sorted lists, merging can be accomplished in 
linear order to the number of community pairs. 

Similarly to [3], each community nominates its largest 
community pair (the pair in its community pair list that has 
the largest AQ value) to be stored in the system-wide max 
heap. This technique allows for efficient retrieval of maxi- 
mum community pair (the pair of communities that has the 
largest AQ value, system- wide) . For this purpose, each com- 
munity maintains a link to the largest pair of communities 
among members of its list. Figure 3 marks the largest com- 
munity pair of communities by black stars (*'s) and links to 
the largest community pairs by "max AQ is" links. When 
two communities merge, the "max AQ is" link for the new 
community can simply be found because anyway we need to 
scan all the community pairs to merge them (Figure 4). 

The use of "max AQ is" link, however, introduces an 
unpleasant problem. When communities d and cj merge 
and AQ value of community pair p = (ci, c^) is updated, we 
need to maintain the integrity of Ck such that its "max AQ 
is" link points to the truly largest community pair in c^'s 
list. 

• If p is not the largest community pair of Ck (or more 
casually p is not marked by a black star) and its AQ 
value decreases, nothing is needed. 



4 



• If p is not the largest community pair of Ck and its AQ 
value increases, we need to compare it with c^'s AQ. 
If the updated value is larger, the "max AQ is" link 
is arranged to point to p (or more casually, we remove 
a black star from c^'s former largest community pair 
and put it to p). 

• If p is Cfc's largest community pair and its AQ value 
increases, nothing is needed. 

• (The Worst case) If p is c^'s largest community pair 
and its AQ value decreases, we do not have a conve- 
nient means to tell if it remains the largest or not. In 
this case, we scan all the community pairs of and 
find the largest one. 

The reader may fear a scenario, where the last case is 
taken most of the time. However, we believe it is not the 
case. The AQ quantity for the community pairs depends 
on the number of neighboring communities that those pair 
have. If the search process for community structure follows 
the preferential attachment law[l], it is expected that there 
exists a heavily linked pair in each community's list and its 
AQ is superior to those of other pairs'. In such situation 
it would be very difficult for others to compete with the 
largest community pair. If this optimistic anticipation is 
guaranteed, the update of AQ is performed in a unit cost 
for each community pair. 

In summary, arranging a set of community pairs in a list 
allows for fast merging cost (0(m) time), fast retrieval of the 
community pair with maximum AQ value (0(1) time), and 
hopefully fast updates of AQ values for the community pairs 
(O(m) time), where m stands for the number of community 
pairs. 

4.2 Heuristics based on consolidation ratio 

In Subsection 3.3, we have seen that the performance of 
the algorithm degraded from unbalanced growth of large 
communities. If, in certain way, we could control the growth 
of communities so that they grow in a balanced manner, it 
is anticipated that the performance of the algorithm will im- 
prove remarkably. To turn this idea into practice, we tested 
three flavors of CNM algorithm that incorporate heuristics 
based on three kinds of consolidation ratio. 

Algorithm 2 Outline of the proposed algorithm. The up- 
dateDeltaQ function remains the same as Algorithm 1. 

function ratio(d ,Cj){ 

return min(|ci|/|cj |, |c,-|/|ci|); 

} 

while (true) { 

updateDeltaQ(); 
Find (ci, Cj ) <E C 2 

that has maximum AQ^.. C . ■ ratio(ci,Cj). 



if (max(AQ^ 
C:= join(ci,Cj) 
} 



< 0) break; 




The structure of the algorithm remains the same as Al- 
gorithm 1. The only difference resides in the valuation ba- 
sis of community pairs. Algorithm 1 uses AQ^, Cj . while 
we use combination of both AQ^r. c . and consolidation ratio 
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Figure 5: Comparison of Elapsed Time 



(ratio (ci,Cj)). This heuristics is designed so that it sup- 
presses unbalanced merge of communities and leads to bal- 
anced growth of communities. 

So far we have not defined how we measure the size of a 
community (|c;|). We have defined three different valuation 
of community size and developed three kinds of heuristics. 

The first heuristics (HE) measures the community size in 
terms of its degree (i.e., the number of edges linked to its 
neighboring communities or the length of its list of com- 
munity pairs). This heuristics was induced from the fact 
that the cost for merging communities is proportional to 
the number of their community pairs (see page 3). 

The second heuristics (HE') was found accidentally when 
we were trying to implement HE. As we have noted, the 
choice of the pair with largest AQ value is two staged. For 
the first stage (selection of a candidate community pair), 
HE' ignores the size of a community and thus behaves equiv- 
alent to CNM algorithm. On the other hand, for the second 
stage, where candidates pairs of maximum AQ is searched 
for, it measures community size in terms of its degree, like 
HE. This weird heuristics, however, works faster than CNM 
algorithm and also it finds better clustering with respect to 
modularity. 

The last heuristics (HN) measures the size of community 
in terms of the number of its members. 

5. EVALUATION 

This section presents results obtained from running four 
flavors of CNM algorithm, the original one proposed in [3] 
and three variations of Algorithm 2 that incorporate our 
heuristics (namely, HE, HE', and HN). 

Four flavors of CNM algorithm, including the original 
one, are implemented using Java platform: Java 5.0, Java 
HotSpot Server VM (build 1.5.0_06 b-05) with 3.2GB heap 
size. The test was performed on a PC (CPU = Intel Xeon 
2.80GHz, L2 Cache = 2MB, RAM = 4GB) running Linux 
(Red Hat Linux version 2.6.16). Though Xeon comes with 
multiple cores, our Java program is single-threaded and makes 
use of no parallelism. 

5.1 Execution Efficiency 

Use of heuristics dramatically accelerates execution of com- 
munity analysis. We have applied four implementations to 
analysis of data sets G" ixi , (n G {50A, 100A, . . . , lOOO-fsT}). 
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Table 1: Elapsed time (seconds) 
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Results are presented in Figure 5 and Table 1. The largest 
data set the original algorithm (Clausct+ (2004)) was pos- 
sible to analyse is G^Jf . It took about 5.9 hours. The 
fastest heuristics was NE. It processes G]^^ in less than 
five minutes. Other heuristics, HE and HE', processes Gj^j 
in about 36 minutes and 3 hours, respectively. They are 
slower than HE but still are practically usable, concerning 
the size of data sets. 

5.2 Consolidation Ratio 

Improvement of consolidation ratio of merged communi- 
ties can explain the speed-up that we have seen previously. 
Figure 6: (a)-(c) demonstrates consolidation ratios of merges 
of community pairs. In Figure 2, we have observed frequent 
unbalanced merges especially in the first half of community 
analysis. Consolidation ratios were some 1:1,000 to 1:10,000. 
In heuristics NE, the fastest one, for the most part of analy- 
sis consolidation ratios are kept better than 1:100 and most 
of the unbalanced merging are performed in the last stage 
of analysis. 

We can observe similar phase-shift in heuristics HE but 
the phase-shift starts earlier than NE and phase transition 
is rather moderate. 

In heuristics HE', it is difficult to observe a phase-shift 
that we have observed for NE and HE. Consolidation ra- 
tios degrade slow as community analysis progresses. As we 
will see shortly, this slow degradation of consolidation ratio 
seems to be a key issue in retaining higher modularity while 
achieving practical computational efficiency. 

As we mentioned earlier, we can observe growth of sev- 
eral large communities in the earlier stage of the original 
algorithm (see Figure 2). In contrast, we can see many 
thin curves running from upper-left to central-right in Fig- 
ure 6-(c). It can be interpreted that multiple communities 
of different sizes are growing in a concurrent manner as com- 
munity analysis progresses. We believe concurrent growth 
of various communities gives more natural explanation to 
the community growth dynamics of a real SNS than than 
sequential development of large communities. 

The impact of the heuristics on improvement of analysis 
time can clearly be seen in Figure 7: (a)-(c). These charts 
presents time required for merging 10,000 community pairs. 
The patterns painted on bars illustrate data sets of different 
scales (G" ixi , n € {200JC, 400JC, 600^, 800^, 1000JC}). 

Unlike Figure 1, computation cost is kept much cheaper 
up to the point when computational cost steeply increases. 
The black bars stand for an experiment performed using 
G^rfili- In this experiment, heuristics NE merges 10,000 com- 
munities in less than 7 seconds for the first 760K merges of 
communities among 870K total merges. It processes the 
heaviest part of the computation in less than 25 seconds, 
which is much smaller than heaviest computation cost per- 
formed in other heuristics, not to mention the original algo- 
rithm. 




le-B5 



le-06 1 1 1 1 1 1 1 1 1 1 

B 58GBB 199 B90 1596B9 2B89B9 259B89 3B99B9 35B96B 4B9GB9 45B9BE 

It Joins 

(a) HE (#edge ratio) 




le-B5 



le-06 1 1 1 1 1 1 1 1 1 1 

B 5QGBQ lQQBQO 150GBB 2B8BBB 25QB8Q 3B99BG 35BGGB 4BQ0BQ 45BQB( 

tt Joins 

(b) HE' (#edge ratio with a bug) 




le-B5 



le-06 1 1 1 1 1 1 1 1 1 1 

B 5QGBQ 100BBG 150GBB 2B8BBB 25QB8Q 3 608 OS 35BQGB 4BQQBQ 45BQB( 

Joins 

(c) NE (#node ratio) 



Figure 6: Consolidation ratio observed during anal- 

ysis ofG^f. 
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Figure 8: Modularity of community structures re- 
sulted from community analysis performed on vari- 
ous scales. 
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Figure 7: Analysis time required for networks of 
various scales. Each bar represents time required 
for merging 10,000 community pairs. 
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HE heuristics merges 10,000 communities in less than 5 
seconds for the first 560K merges among 870K total merges. 
In the computationally heavy part, it takes 60-130 seconds 
per 10,000 merges. 

Merge cost of HE' heuristics is much higher than those 
of HE and NE. In the computationally heavy part, it takes 
100-650 seconds per 10,000 merges. 

5.3 Modularity 

It is our concern that use of heuristics reduces modularity 
of the resulting community structure. Figure 8 and Table 2 
presents modularity of the community structures obtained 
from the experiments. The vertical shaft is Q ■ m 2 , where Q 
is modularity as defined in [3] and m is the number of edges 
in the graph. In our implementation (including implemen- 
tation of the original algorithm), we use AQ ■ m 2 , instead of 
AQ because the former takes integer values and allows us to 
replace costly floating-point arithmetics by cheaper integer 
arithmetics. 

To our surprise, HE' performs slightly better than the 
original algorithm. The original algorithm attempts to opti- 
mize on AQ solely but it is known that greedy optimization 
does not necessarily lead to fully optimized result. Heuris- 
tics HE' is our proof of the fact that CNM algorithm can be 
improved in both speed and modularity. It processes 
data set 7 times faster, improves modularity by 8-11%, and 
can process much larger data set that are incapable for the 
original proposal to process. 

Heuristics HN performs slightly better in speed than HE 
but the community structures they produce exhibit rather 
poor modularity: they were lower than the modularity re- 
sulted from the original algorithm by 21-28%. 

It is interesting to see how modularity is improved as the 
community analysis progresses (Figure 9). The horizontal 
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Figure 9: Growth of modularity as community anal- 
ysis progresses. The data set used is G^hd 



Figure 11: Scalability 
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Figure 10: Sizes of communities: both shafts are in 
log-scale. 



shaft is normalized to the elapsed time of each community 
analysis. 

In the original algorithm, modularity gradually improves. 
Though it attempts greedily optimize on modularity, modu- 
larities of community structures computed by using heuris- 
tics are superior during the first half of the computation. 
This chart also suggests that greedy optimization does not 
successfully optimize modularity. 

Heuristics HE' demonstrates steep growth of modularity 
in the very early stage and it grows steadily up to the end 
of analysis. The growth of HN is similar to the original al- 
gorithm. In HE, modularity grows rather steeply but its 
growth almost stops shortly. It might be possible to in- 
terpret this fact that HE forms core structure in its earlier 
stage and that we can stop community analysis at the early 
stage which produces an approximation of the community 
structure. 

So far we have mainly discussed the quality of community 
clusterings in terms of their modularities as defined in [11]. 
It is an important issue to compare the structures produced 
by four flavors of CNM algorithm. Figure 10 depicts a his- 
togram of community size in a log-scale chart. All methods 
find a few large (> 10,000) communities and a lot of small 
(< 10) ones. Also they find almost no middle-sized com- 



munities. The original algorithm finds larger communities 
(> 20, 000 members) than our heuristics. 

An important question to answer is "existence of corre- 
spondence between communities found using different flavor 
of CNM algorithm". If it is not the case, reliability of the 
results produced by CNM algorithm may need to be re- 
considered. At this moment, this remains to be an open 
question. 

5.4 Scalability 

Figure 11 is obtained from applying proposed heuristics on 
larger data sets, ranging from 1M nodes up to 5.5M nodes. 
HE and HN demonstrates almost linear speed up. Scala- 
bility of HE', on the other hand, is slowly declining but we 
estimate that it is applicable to networks that has up to 
10M nodes. Scalability of the algorithm is bound by mem- 
ory size for standard PC. HE and HN failed to process a 
network that consists of 5.5M nodes due to lack of physical 
memory. 

Our current implementations of CNM algorithm are not 
optimized for reduction of memory usage. We plan to re- 
implement it and achieve better use of memory. Hopefully 
we achieve to analyse larger networks with 10M nodes, soon. 
Further acceleration of the algorithm requires use of paral- 
lelism. 

6. SUMMARY 

The paper identified a bottleneck of a community analysis 
algorithm proposed by Clauset, Newman, and Moore [3]. Its 
inefficiency was caused from unbalanced structuring of com- 
munities. The paper proposes three heuristics that attempt 
to balance the size of communities being merged. We have 
removed the bottleneck and successfully obtained commu- 
nity structures of large scale social networks that contain 
over 5,000,000 nodes. Our approach is scalable. It is ex- 
pected to scale to a SNS network that contains 10,000,000 
nodes. 

There still remain unanswered interesting issues. How 
are community structures found by different algorithm re- 
late with each other? How algorithmically found cyber- 
community structures relate to human communities. Is it 
possible to explain the dynamics of SNS community growth 
in terms of the progress of community analysis? 

From a technical stand point, we are interested in how 
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much faster and how scalable are our proposals. We are 
interested in parallelization of community analysis. The im- 
pact of our research to middle-scale social network is large. 
Our research has made it possible to analyse a middle scale 
social network (with 100,000 nodes) in a few minutes on a 
standard laptop computer and we are freed from waiting re- 
sponse from community analysis performed on a server for 
days and hours. 

We are currently working on visual presentation of clus- 
ter structures with Dr. Hiroshi Hosobe and Mr. Minato 
Koshida. We are also working on analysis of cyber-communities 
found in social networking services and their dynamics. 
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